RNA-Seq Data Analysis ◾ 185
design matrix using the sample information in the “sampleinfo.txt” that we have created
above. In EdgeR, the design matrix can be defined with or without an intercept. The inter-
cept is used when there is a reference for the differential expression analysis. When the
design matrix is defined without an intercept, the differential analysis can be performed
by using a contrast as we will do. In the following, we define a design matrix without an
intercept (Figure 5.9):
condition <- factor(sampleinfo$condition)
design <- model.matrix(~ 0 + condition)
design
This design matrix defines two dummy variables representing the levels of the condition
studied (1 if the condition is correct and zero otherwise). When we fit a negative binomial
generalized log-linear model described in Formula 22, two coefficient estimates will be
calculated; one for each dummy variable.
5.3.7.4 Filtering Low-Expressed Genes
Some genes may not be expressed or may not have enough reads to contribute to the dif-
ferential analysis. Therefore, it is good practice to retain only the genes that have sufficient
read counts by filtering out the genes with zero or low counts keeping only the ones with at
least one count per million (1 cpm) reads in at least two samples. The following script filters
out the genes with low abundance and adjust the library size to reflect the new change:
keep <- filterByExpr(y, design)
y <- y[keep, , keep.lib.sizes=FALSE]
As shown in Figure 5.10, after filtering, the counts slot contains only genes with sufficient
abundance and the library size in the samples slot has been adjusted. Notice the difference
in the number of genes and library size between Figures 5.10 and 5.6. The new counts slot
contains only 133 genes compared to 632 genes before filtering and the library sizes have
been adjusted to reflect the new ones.
FIGURE 5.9 Design matrix without intercept.